Binary Recursive

نویسندگان

  • Edgar C. Merkle
  • Victoria A. Shaffer
چکیده

Binary recursive partitioning (BRP) is a computationally-intensive statistical method that can be used in situations where linear models are often used. Instead of imposing many assumptions to arrive at a tractable statistical model, BRP simply seeks to accurately predict a response variable based on values of predictor variables. The method outputs a decision tree depicting the predictor variables that were related to the response variable, along with the nature of the variables’ relationships. No significance tests are involved, and the tree’s “goodness” is judged based on its predictive accuracy. In this paper, we describe BRP methods in a detailed manner and illustrate their use in psychological research. We also provide R code for carrying out the methods. Binary Recursive Partitioning 3 Binary recursive partitioning: Background, methods, and application to psychology Binary recursive partitioning (BRP), also referred to as Classification and Regression Trees (CART)1, is a computationally-intensive statistical method that is often attributed to Breiman, Friedman, Olshen, and Stone (1984). The method can be used in situations where one is interested in studying the relationships between a response variable and predictor variables, with “classification” referring to trees with categorical response and “regression” referring to trees with continuous response. The method has many desirable properties, some of the most notable being: (1) it is nonparametric (in the sense that no stochastic model is imposed on the data) and free of significance tests; (2) the predictor and response variables can be of all types (continuous, ordinal, categorical), with minimal change in the underlying algorithm and the resulting output; (3) missing data are handled without the need for imputation techniques; (4) it is invariant to monotonic transformations of the predictor variables; and (5) it is minimally impacted by outliers. As discussed later in the paper, these attributes prove advantageous in some situations where linear models are suboptimal or where the data are exceedingly “messy.” Modern BRP methods, including those developed by Breiman et al. (1984), are related to earlier procedures that supplement regression analyses. Many of these early procedures were developed in the social sciences (specifically, at the Institute for Social Research, University of Michigan). The most well-known procedure may be the Automatic Interaction Detection method developed by Morgan and Sonquist (1963; see also Sonquist, 1970). Automatic Interaction Detection (AID) was proposed as a way of identifying complex interactions in datasets with continuous response variables. The technique involves sequentially splitting the data into two groups based on values of predictor variables, such that the between-groups sum of squares is maximized at each split. A Binary Recursive Partitioning 4 specific split is obtained by calculating the between-groups sum of squares for each possible split and choosing the split that yields the largest sum of squares. The original AID procedure was extended to handle categorical response variables (Morgan & Messenger, 1973) and to improve user accessibility (Sonquist, Baker, & Morgan, 1974), and similar procedures were developed for ordinal response variables (e.g., Bouroche & Tennenhaus, 1971). Fielding (1977) presents an excellent overview of these early procedures and their use in practice. A general difficulty was that the results were unstable. For example, Sonquist et al. (1974) state: “A warning to potential users of this program: data sets with a thousand cases or more are necessary; otherwise the power of the search processes must be restricted drastically or those processes will carry one into a never-never land of idiosyncratic results” (p. 3). Modern BRP methods employ a variety of strategies to make results more stable, or at least to give the user more information about the procedures’ general predictive abilities. While there is a body of research devoted to modern BRP (Ripley, 1996; Zhang & Singer, 1999 present detailed reviews), the literature is largely either: (1) very technical, or (2) application-based and glossing over details of the procedure. Furthermore, while there have been some recent applications and development of BRP in psychology (e.g., Dusseldorp & Meulman, 2004), we have found BRP to be relatively unknown in the field. This is unfortunate because the procedure has roots in the social sciences and could be useful to many research endeavors. BRP is also a “gateway method,” in the sense that variants of BRP drive more advanced methods that are excellent tools for prediction. These advanced methods are briefly described at the end of the tutorial. The purpose of this paper is to provide an overview of BRP, including information about each step of the procedure and about carrying out the procedure in R. In the following pages, we first provide an introductory example and then discuss specific details about BRP. The description draws from Breiman et al. (1984), as well as from our own Binary Recursive Partitioning 5 experiences with coding a BRP program from scratch.2 Next, we describe the application of BRP to a psychology study involving the decisions of mock jurors in malpractice lawsuits. Finally, we compare BRP with regression and discuss some general information regarding the use of BRP in practice.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

P´olya Urn Models and Connections to Random Trees: A Review

This paper reviews P´olya urn models and their connection to random trees. Basic results are presented, together with proofs that underly the historical evolution of the accompanying thought process. Extensions and generalizations are given according to chronology: • P´olya-Eggenberger’s urn • Bernard Friedman’s urn • Generalized P´olya urns • Extended urn schemes • Invertible urn schemes ...

متن کامل

On the first variable Zagreb index

‎The first variable Zagreb index of graph $G$ is defined as‎ ‎begin{eqnarray*}‎ ‎M_{1,lambda}(G)=sum_{vin V(G)}d(v)^{2lambda}‎, ‎end{eqnarray*}‎ ‎where $lambda$ is a real number and $d(v)$ is the degree of‎ ‎vertex $v$‎. ‎In this paper‎, ‎some upper and lower bounds for the distribution function and expected value of this index in random increasing trees (rec...

متن کامل

Multiclass Cancer Classification by Using Fuzzy Support Vector Machine and Binary Decision Tree With Gene Selection

We investigate the problems of multiclass cancer classification with gene selection from gene expression data. Two different constructed multiclass classifiers with gene selection are proposed, which are fuzzy support vector machine (FSVM) with gene selection and binary classification tree based on SVM with gene selection. Using F test and recursive feature elimination based on SVM as gene sele...

متن کامل

On the Complexity of Finding the Chromatic Number of a Recursive Graph I: The Bounded Case

ABSTRACT We classify functions in recursive graph theory in terms of how many queries to K (or ∅ or ∅) are required to compute them. We show that (1) binary search is optimal (in terms of the number of queries to K) for finding the chromatic number of a recursive graph and that no set of Turing degree less than 0 will suffice, (2) determining if a recursive graph has a finite chromatic number i...

متن کامل

Binary Random Sequences Obtained From Decimal Sequences

D sequences [1-15] are perhaps the simplest family of random sequences that subsumes other families such as shift register sequences [16]. In their ordinary form, d sequences are not computationally complex [2], but they can be used in a recursive form [8] that is much stronger from a complexity point of view. The basic method of the generating the binary d-sequences is given in [1]. The autoco...

متن کامل

Limit laws for functions of fringe trees for binary search trees and random recursive trees

We prove general limit theorems for sums of functions of subtrees of (random) binary search trees and random recursive trees. The proofs use a new version of a representation by Devroye, and Stein’s method for both normal and Poisson approximation together with certain couplings. As a consequence, we give simple new proofs of the fact that the number of fringe trees of size k = kn in the binary...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010